In this document I’m going to do a prototype of data analysis and compound identification on a simple dataset. This dataset consists of 6 samples, two replicates of each of the following sample types:
XCMS processing of this dataset is already done in both polarities. The only difference in the processing is that for NEG polarity I’ve retained those features that contain at least 5 peaks with intensity >= 10^{5} counts, whereas for POS polarity the intensity threshold has been 2.5^{5} counts.
I’ll be focused on features characteristic of 1 sample type. For that I’ll select those features that are present in only 1 sample type and which mean(intensity) for that sample type is more than 10^{8} times higher than the mean(intensity) of the other samples.
First of all, I load the data and select the features of interest:
Below, I plot the distribution of samples and features through a PCA highliting the selected features:
The figure shows the score (left side) and loading (right side) plots of PCA from positive (upper side) and negative (bottom side) data.
Samples in the score plots are colored according to their class, and features in the loading plots are colored according to their specificity for a certain sample class (i.e. they are colored when they have only been detected in one type of sample).
The intensity of feature colors in the loadings plot indicates if the mean intensity for a specific class is at least 10^{8} times higher in comparison of the mean intensity within the other samples not corresponding to that class.
Now, I’m going to apply the functions generated for doing the feature grouping and compound identification to this dataset, and then I’ll print the generated table of identifications:
In summary, it can be observed that the 52 features that fulfilled the initial criteria have been grouped in 28 compounds, 7 (25%) corresponding to a known compound. Additionally, among the 52 filtered features, 10 (19%) corresponded to the identified compounds.
At this point, I’m interested in checking which are the RT and ppm deviations between the theoretical database and the experimental data:
## $NEG
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -2.160 -2.079 -1.997 -1.935 -1.822 -1.646
##
## $POS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.8531 -1.7842 -1.5365 -1.5346 -1.3966 -0.9907
## $NEG
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.59 17.62 20.65 18.63 20.65 20.65
##
## $POS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.915 11.695 16.782 15.278 20.655 21.548
It can be seen that ppm deviations range between -2.2 and -1.0, whereas RT deviations move between 4 and 22 seconds.
## Time difference of 1.564729 mins